Part1: Statistics analysis

read in CCLE metabolism profiling (Figure 1)

combine with RNA profiling

1)Read in RNA data
2)data processing: CCDS,missing data,log,quantile normalization
3)merge data

Merge data

55 media components + 12 clinical info + 225 metabolites + 13403 genes - 1 "Cell_culture_media" =13964

dat_merge:454 samples (55 media components + 12 clinical info + 225 metabolites + 13403 genes - 1 "Cell_culture_media")
dat_metab:928 CCLE metabolism profiling
dat_rna:1019 processed CCLE RNA profiling
dat_m_rna:454 samples
(12 clinical info + 225 metabolites + 13403 genes)
dat: 460 adherent samples * (12 clinical info + 225 metabolites)
target_features: 12 clinical info, c("Original.Source.of.Cell.Line","growth_properties","SampleType","inferred_ethnicity","Age","Pathology","mutRate","tcga_code",'Cell_culture_media','FBS.','Doubling_time_hr','GI')

metabolites correlation

PCA plot

(Additional) GSEA

Basically they are telling us to stratify samples based on metabolites levels (i.e. for a given metabolite pick samples from the top and bottom quartile) and then use these groups to perform a GSEA analysis using the fold change of genes between the high and low group for each metabolite and then ask whether any metabolic pathways are enriched.

Part 2: Model Training

covariate selection

media

step lm model train y~media components+ clinical traits, see how much does the media influence the metabolite level
http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

we can see:
Biotin(VB7),CaCl2,Choline Chloride,CuSO4-5H2O,D-Calcium pantothenate,D-Glucose(D+ Galactose),FeSO4-7H2O,Folic Acid,Glutathione (reduced),Glycine,HEPES,Hypoxanthine Na,i-Inositol are the major components that change the metabolite level, but did not make a big impact, since most R2 is less then 0.2, means most metabolite level does not influenced by major media componment.

following metabolite may impacted by media components:
'taurocholate''glycodeoxycholate/glycochenodeoxycholate''taurodeoxycholate/taurochenodeoxycholate''glycine''serine''threonine''methionine''asparagine''histidine''arginine''lysine''valine''leucine''isoleucine''phenylalanine''tyrosine''tryptophan''cis/trans-hydroxyproline''ornithine''5-HIAA''thiamine''niacinamide''choline''pyroglutamic-acid''methionine-sulfoxide'

select high correlated genes

specific LASSO regression model
1)generate trainning set and testing set (460 adherent samples, 7:3)
2)Univariant linear regression model (y ~ gene) to see the relationship betweet each metabolite and gene.
3)Top 5% high R2 genes as features and train lasso regression model for each metabolite. (consider multiconlinearity) 4)test in testing set

permutation test (global lasso/lm)

Part 3: supporting evidences

Protein lasso models & permutation test

7:3 training

Pathway model training

GSVA

1)generate a sub-dataset as trainning set (n=454)
2)use metabolism pathway related geneset(554) as features and train lasso regression model for each metabolite
3)validation in another Cell line samples cohort.

predict metabolite with other metabolite

xanthine -> xanthosine

MNA prediction model

validation in bulk samples

validation in most predictable metabolite - MNA

model validation

sample cancer purity check

cell type impact